# Pandas: DataFrames for Python

Python is a general purpose language. It doesn't have to be better than a specialized language, it just has to have a good enough library - it is better at all the other parts, like dealing with files, CLI/GUI, etc.


DataFrames (well known from R) are like Excel spreadsheets in Python. (In fact, it can open Excel files). They are for _structured data_. If a NumPy axis has a meaning you want to assign a name to, it's probably structured.

In [None]:
import pandas as pd

We could make a DataFrame by hand, most most of the time you'll load them from various data sources. So let's make a CSV:

In [None]:
%%writefile tmp.csv
id,                version, os,    arch
cp37-macos_arm64,  3.7,     macos, arm64
cp38-macos_arm64,  3.8,     macos, arm64
cp39-macos_arm64,  3.9,     macos, arm64
cp37-macos_x86_64, 3.7,     macos, x86_64
cp38-macos_x86_64, 3.8,     macos, x86_64
cp39-macos_x86_64, 3.9,     macos, x86_64

By default, pandas can read it, and even nicely format something for your screen:

In [None]:
pd.read_csv("tmp.csv")

There are lots of powerful tools when reading and for later cleanup; let's do a better job of importing.

In [None]:
df = pd.read_csv(
    "tmp.csv",
    index_col=0,
    skipinitialspace=True,
    dtype={"os": "category", "arch": "category"},
)
df

In [None]:
df.info()

We can query columns (or anything else):

In [None]:
df["os"]

For simple names, columns can be even easier to access:

In [None]:
df.arch

You have quick, easy access to lots of analysis tools:

In [None]:
df.version.plot.bar();

You can select using a variety of methods, including NumPy style boolean arrays:

In [None]:
df[df.arch == "arm64"]

The powerful groupby lets you collect and analyze with ease. For example, to compute the mean for each possible arch:

In [None]:
df.groupby("arch").version.mean()

Pandas pioneered a lot of DSL (Domain Specific Language) for Python, taking over the Python language to keep things simple and consistent within DataFrames. For example, it provides accessors, like the `.str` accessor, that apply normal methods to a series instead:

In [None]:
df.arch.str.upper()

This is just scratching the surface. Besides manipulating these dataframes and series, Pandas also offers:

* Fantastic date manipulation, including holidays, work weeks, and more
* Great periodic tools, rolling calculations, and more

Great Pandas, like vectorized NumPy, can be a little hard to write, taking a few iterations, but once you have it written, it is easy to read and very expressive.

## More reading

See this notebook than analyze COVID data that runs daily on my website: <https://iscinumpy.gitlab.io/post/johns-hopkins-covid/>